Analyst's Workspace (AW) is a spatially-oriented sensemaking tool designed explicitly for use on large, high resolution displays, which has been developed at Virginia Tech over the past few years. The tool provides an environment in which foraging and evidence marshaling can be integrated together. The large display allows AW to use full-text documents as fundamental artifacts which the analyst can move around the space like physical sheets of paper. The use of real physical space permits the analyst to rapidly access information through physical navigation (eye, head, and body movement), while maintaining a strong sense of the layout of the space.
Exploration is facilitated through named entities, which are visually identified within the documents. Expanded entities coexist in the space with the documents, providing access to other documents containing the entity, direct visual connections to occurrences in all open documents, links to co-occurring entities, and useful artifacts for evidence marshaling at a different level of abstraction from raw documents.
In addition to document-entity exploration, AW supports document-document and entity-entity exploration through the use of graph-traversing "storytelling" algorithms based on data mining of the text. The analyst draws a link between two documents or two entities and AW provides connecting material. A neighborhood tool can uncover "similar" documents based on a vector-space model of the text.
To prepare the data for analysis, we used a combination of LingPipe, OpenNLP and the Stanford NER to extract entities, and document classification was performed using the AlchemyAPI.
MC 3.1 Potential Threats: Identify any imminent terrorist threats in the Vastopolis metropolitan area. Provide detailed information on the threat or threats (e.g. who, what, where, when, and how) so that officials can conduct counterintelligence activities. Also, provide a list of the evidential documents supporting your answer.
The most immediate threat is the possibility that the food supply has been exposed to a bioweapon by the Paramurders of Chaos (PoC). On 4-26, Professor Patino, a known expert in bioterrorism (03212.txt), had a large quantity of equipment stolen from his laboratory. On 5-13, police apprehended three members of the PoC who had remained behind during the raid to destroy their basement laboratory (03435.txt). While there is no direct connections to Patino's missing equipment, the presence of high-end equipment and Petri dishes is suggestive, and the link should be followed up.Two days later, a supposed member of the PoC was caught trespassing at a local food preparation plant (01878.txt). Given the potential that PoC has been developing bioweapons, we can not dismiss the possibility that despite the apprehension, contamination has occurred.
On 5-17, a potentially unrelated accident occurred on the bridge over the Vast River involving a truck carrying food products (01030.txt). This was followed by reports of large numbers of dead fish in the river (01038.txt). We do not know if the truck came from the same plant, or if the die-off is caused by the chemicals carried by the other truck, or some other cause. However, without further investigation, we cannot rule out the possibility that this is evidence that a bioweapon has been deployed, and that it is now loose in the river.
A secondary threat may be posed by the Network of Hate (NoH). On 4-26, a number of weapons were stolen from the Armed Forces compound, including three missiles and twenty rifles (02287.txt). On 5-16, three missiles were recovered during a traffic stop, and the driver had connections to the NoH (02395.txt). Twenty rifles were then recovered by the TSA on 5-20 (00499.txt). Even if all of the armaments have been recovered (for which we have no confirmation), the culprits (other than the truck driver) are still at large, and a group working at this level warrants further investigation. An incident in which large caliber weapons were fired in a park, occurring four days after the initial robbery (04293.txt), may provide further leads if the casings can be tied back to the stolen ammunition.
Before the data was loaded into AW, it underwent extensive preprocessing. Entities were extracted and then used to build a pair of entity graphs — one based on sentence co-occurrence built up from a semantic model of the text, with a fall-back based on document co-occurrence. A vector-space model of the text was also constructed to facilitiate the document-document exploration and document neighbor discovery. We constructed a list of 179 words occurring in the Global Terrorism Database, and used this to rank the documents based on the frequency of appearance. Finally, the documents were classified using AlchemyAPI. The entire pipeline took approximately 48 hours.
Within AW, analysis began by examining the documents classified as "law/crime" (81 documents) and "unknown" (41 documents). Documents were skimmed using the browser's preview function and interesting ones were opened and read more carefully. Entities that were unidentified or misidentified were corrected, important phrases were highlighted, and key entities were opened next to the document. These were then moved to different locations on the display based on related events, entities or themes for later followup.
The small clusters formed by this process created a number of interesting seeds that were then fleshed out. First, the key entities in the documents were opened and the related articles were added to the cluster. Using the temporal ordering tool, the documents were arranged into temporally ordered columns to illuminate event sequences. Documents were then moved horizontally to represent potentially separate threads of events. Textual search helped to follow up on themes not directly tied to entities.
Figure 2. Entities appear in three forms. Note the use of deep linking to occurrences within the document.
For example, the document about the arrest at the food plant led to searches about risks to the food supply. This led to a document about bioterrorism. A followup search on bioterrorism led to Professor Patino, and following that, information about the robbery at his lab. Inserting this into the timeline for our cluster, we find a potential sequence of events.
Figure 3. Documents, highlighting, entities, and spatial relationships all contribute to the way clusters can visually represent stories.
After following the obvious leads out of a cluster of documents, the neighborhood tool was used to find similar documents. This tool allows the analyst to find documents that are not necessarily directly related, but sharing common features with the source document, which can lead to new events that would not be uncovered through direct means.
When the various seed clusters were explored, it was determined that the most serious ongoing threat was the potential that a bioweapon had been deployed on the 15th. The documents dated after the 15th were then scanned looking for any events that could be fallout from that event. Here, the use of color coding in the browser to indicate which documents had been scanned already reduced the number of documents that needed to be considered and led to the discovery of the dead fish in the river and then the potentially related accident.
Finally, the workspace was rearranged by potential sequences of events and by the type of threat. In total, the collection of leads were identified followup up on and schematized in approximately 2.5 hours. Approximately two additional hours were then spent scanning for other potential leads or evidence.
Figure 4. Final state of the workspace. Different clusters represent different threats or themes. The primary threat is in the lower right, surrounded by support, notes, and other leads pursued. The secondary threat is to the left. The rest of the workplace contains other leads that were pursued with relevance/seriousness roughly falling away to the left. (Click for full size image)